Enhancing Document Clustering Using Hybrid Models for Semantic Similarity

نویسندگان

  • Ahmed K. Farahat
  • Mohamed S. Kamel
چکیده

Different document representation models have been proposed to measure semantic similarity between documents using corpus statistics. Some of these models explicitly estimate semantic similarity based on measures of correlations between terms, while others apply dimension reduction techniques to obtain latent representation of concepts. This paper proposes new hybrid models that combine explicit and latent analysis to estimate semantic similarity between documents. The proposed models have been used to enhance the performance of document clustering algorithms. Experiments on thirteen benchmark data sets show that hybrid models achieve significant improvement in clustering performance when used with clustering algorithms that are sensitive to errors in estimating document similarity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Domain Based Punjabi Text Document Clustering

Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure & separating the dissimilar documents. Popular clustering algorithms available for text clustering treats document as conglomeration of words. The syntactic or semantic relations between words are not given any consideration. Many different algorithms ...

متن کامل

Hybrid Document Indexing with Spectral Embedding

Document representation has a large impact on the performance of document retrieval and clustering algorithms. We propose a hybrid document indexing scheme that combines the traditional bagof-words representation with spectral embedding. This method accounts for the specifics of the document collection and also uses semantic similarity information based on a large scale statistical analysis. Cl...

متن کامل

An Efficient Document Clustering Based on HUBNESS Proportional K-Means Algorithm

Evaluating similarity between the documents is a main operation in the text processing field. Similarity measurement is used to estimate the relationship between the records or documents.In existing system similarity between two documents can be computed with respect to feature by using Similarity Measure for Text Processing (SMTP). In proposed hybrid SMTP scheme is integrated with hubness base...

متن کامل

Hybrid Approach for Punjabi Text Clustering

Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algori...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010